📌 Title¶

Exploring Traffic Patterns on I-94: A Data-Driven Approach¶


📝 Project Description¶

In this project, I explored a dataset containing hourly traffic data from the I-94 Interstate to understand what factors influence heavy traffic. My goal was to identify patterns based on weather, time of day, and day of the week — and to practice building insights through exploratory data analysis.

The traffic data focuses on the westbound direction, from Saint Paul to Minneapolis.


🎯 Objective¶

I wanted to investigate whether traffic volume is affected by external factors like seasonal changes (e.g., summer vs. winter) and weather conditions (e.g., snow or rain).

The I-94 Traffic Dataset¶

In [ ]:
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns,plotly.express as px, plotly.graph_objects as go
from plotly.subplots import make_subplots 
import plotly.io as pio
pio.renderers.default = 'notebook_connected'  # or 'iframe_connected' if you want isolation

i_94_traffic = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
In [173]:
i_94_traffic.head(5)
Out[173]:
holiday temp rain_1h snow_1h clouds_all weather_main weather_description date_time traffic_volume
0 NaN 288.28 0.0 0.0 40 Clouds scattered clouds 2012-10-02 09:00:00 5545
1 NaN 289.36 0.0 0.0 75 Clouds broken clouds 2012-10-02 10:00:00 4516
2 NaN 289.58 0.0 0.0 90 Clouds overcast clouds 2012-10-02 11:00:00 4767
3 NaN 290.13 0.0 0.0 90 Clouds overcast clouds 2012-10-02 12:00:00 5026
4 NaN 291.14 0.0 0.0 75 Clouds broken clouds 2012-10-02 13:00:00 4918
In [174]:
i_94_traffic.tail(5)
Out[174]:
holiday temp rain_1h snow_1h clouds_all weather_main weather_description date_time traffic_volume
48199 NaN 283.45 0.0 0.0 75 Clouds broken clouds 2018-09-30 19:00:00 3543
48200 NaN 282.76 0.0 0.0 90 Clouds overcast clouds 2018-09-30 20:00:00 2781
48201 NaN 282.73 0.0 0.0 90 Thunderstorm proximity thunderstorm 2018-09-30 21:00:00 2159
48202 NaN 282.09 0.0 0.0 90 Clouds overcast clouds 2018-09-30 22:00:00 1450
48203 NaN 282.12 0.0 0.0 90 Clouds overcast clouds 2018-09-30 23:00:00 954
In [175]:
i_94_traffic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48204 entries, 0 to 48203
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   holiday              61 non-null     object 
 1   temp                 48204 non-null  float64
 2   rain_1h              48204 non-null  float64
 3   snow_1h              48204 non-null  float64
 4   clouds_all           48204 non-null  int64  
 5   weather_main         48204 non-null  object 
 6   weather_description  48204 non-null  object 
 7   date_time            48204 non-null  object 
 8   traffic_volume       48204 non-null  int64  
dtypes: float64(3), int64(2), object(4)
memory usage: 3.3+ MB

📂 Dataset Overview¶

The dataset contains 48,204 rows and 9 columns. Most columns are complete, except holiday, which has some missing values. Each row captures weather and traffic data for a specific hour.

The time range spans from 2012-10-02 09:00:00 to 2018-09-30 23:00:00.

🚦 Understanding Traffic Volume¶

In [176]:
fig = px.histogram(i_94_traffic,i_94_traffic['traffic_volume'],labels={'traffic_volume':'Traffic Volume','count':'C'},title='Traffic Volume Distribution', nbins=10,text_auto=True,width=600)
fig.show()
In [177]:
i_94_traffic['traffic_volume'].describe()
Out[177]:
count    48204.000000
mean      3259.818355
std       1986.860670
min          0.000000
25%       1193.000000
50%       3380.000000
75%       4933.000000
max       7280.000000
Name: traffic_volume, dtype: float64

To begin, I visualized the distribution of the traffic_volume column using a histogram. Here's a quick statistical summary:

  • Minimum: 0

  • Maximum: 7,280

  • Mean: ~3,260

  • 25th percentile: ~1,193

  • 75th percentile: ~4,933

This tells us that traffic volume varies a lot, and it seems like there are distinct periods of low and high volume. That led me to investigate how traffic differs between daytime and nighttime.

🌙 Day vs. Night Traffic¶

I split the data into:

  • Daytime: 07:00 to 19:00 (7 AM to 7 PM)

  • Nighttime: 19:00 to 07:00 (7 PM to 7 AM)

In [178]:
i_94_traffic['date_time'] = pd.to_datetime(i_94_traffic['date_time'])

daytime_traffic = i_94_traffic.copy()[(i_94_traffic['date_time'].dt.hour >= 7) & (i_94_traffic['date_time'].dt.hour  < 19)] 
nighttime_traffic = i_94_traffic.copy()[(i_94_traffic['date_time'].dt.hour >= 19) | (i_94_traffic['date_time'].dt.hour < 7) ]
print(f"Day Time Shape: {daytime_traffic.shape},\nNight Time Shape: {nighttime_traffic.shape}")
Day Time Shape: (23877, 9),
Night Time Shape: (24327, 9)
In [179]:
i_94_traffic.iloc[176:178]
Out[179]:
holiday temp rain_1h snow_1h clouds_all weather_main weather_description date_time traffic_volume
176 NaN 281.17 0.0 0.0 90 Clouds overcast clouds 2012-10-10 03:00:00 361
177 NaN 281.25 0.0 0.0 92 Clear sky is clear 2012-10-10 06:00:00 5875
  • Daytime rows: 23,877

  • Nighttime rows: 24,327

There’s a small difference in row count due to two missing hours in the dataset.

The notable difference between the daytime_traffic row count and nighttime_traffic row count is explainable by missing data. The data was not collected for two hours.

📊 Traffic Volume Distribution¶

In [180]:
fig = make_subplots(rows=1,cols=2,column_widths=[.50,.50])
# fig = go.Figure()
traffic_volume_day_plot = go.Histogram(x=daytime_traffic['traffic_volume'],nbinsx=10,name='Day',)

traffic_volume_night_plot = go.Histogram(x=nighttime_traffic['traffic_volume'],nbinsx=10,name='Night')

fig.add_trace(traffic_volume_day_plot,1,1)
fig.add_trace(traffic_volume_night_plot,1,2)
fig.update_layout(xaxis1_title_text='Traffic Volume',yaxis1_title_text='Frequency',xaxis2_title_text='Traffic Volume',yaxis2_title_text='Frequency', title={'text':"Traffic Volume Comparison: Day Vs Night",'x':0.5}, width=1000)
fig.show()
In [181]:
daytime_traffic['traffic_volume'].describe()
Out[181]:
count    23877.000000
mean      4762.047452
std       1174.546482
min          0.000000
25%       4252.000000
50%       4820.000000
75%       5559.000000
max       7280.000000
Name: traffic_volume, dtype: float64
In [182]:
nighttime_traffic['traffic_volume'].describe()
Out[182]:
count    24327.000000
mean      1785.377441
std       1441.951197
min          0.000000
25%        530.000000
50%       1287.000000
75%       2819.000000
max       6386.000000
Name: traffic_volume, dtype: float64

From the histograms:

  • Daytime traffic is mostly higher, with a peak around 4,000–5,000

  • Nighttime traffic tends to be lower, with many hours under 2,000

The average hourly volume during the day is about 4,252, while at night it drops to around 1,785.

This confirmed my hypothesis: traffic is significantly heavier during the day.

🕒 Time-Based Patterns¶

Next, I explored how traffic changes across different time dimensions — starting with month, then day of the week, and finally hour of the day.

📅 By Month¶

In [183]:
daytime_traffic['month'] = daytime_traffic['date_time'].dt.month
daytime_traffic_by_month = daytime_traffic.groupby('month').mean(numeric_only=True)
fig = px.line(daytime_traffic_by_month,y=daytime_traffic_by_month['traffic_volume'],x=daytime_traffic_by_month.index, width=600,title="Month Indicator Plot")
fig.update_traces(textposition="bottom right")
fig.show()
In [184]:
daytime_traffic['year'] = daytime_traffic['date_time'].dt.year
daytime_traffic_in_july = daytime_traffic[daytime_traffic['month'] == 7 ]
daytime_traffic_in_july = daytime_traffic_in_july.groupby('year').mean(numeric_only=True)
fig = px.line(daytime_traffic_in_july,y=daytime_traffic_in_july['traffic_volume'],x=daytime_traffic_in_july.index, width=600,title="Traffic Each Year")
fig.update_traces(textposition="bottom right")
fig.show()

Adding a month column and plotting the mean traffic by month showed a few things:

  • Lower traffic in the early months of the year (January–March) and the last two (November–December)

  • Most months show higher volume — except for July, which dips unexpectedly

Digging deeper, I found that July 2016 had an especially low average — possibly due to major road construction, which aligns with reported I-94 lane closures that year. Lane Closure Article: I-696 closure, I-96/US-23 bridge work, I-94 lane closures

📆 By Day of the Week¶

In [185]:
daytime_traffic['dayofweek'] = daytime_traffic['date_time'].dt.dayofweek
traffic_by_dayofweek = daytime_traffic.groupby('dayofweek').mean(numeric_only=True)
fig = px.line(traffic_by_dayofweek,y=traffic_by_dayofweek['traffic_volume'],x=traffic_by_dayofweek.index, width=600,title="Traffic Each Day of The Week")
fig.update_traces(textposition="bottom right")
fig.show()

Grouping by dayofweek (where Monday = 0), I found:

  • Weekdays (Monday to Friday) consistently have higher traffic

  • Weekends show a clear drop in average volume

This makes sense given work commute patterns.

⏰ By Hour (Weekdays vs. Weekends)¶

To avoid weekend bias, I separated business days from weekends:

In [186]:
# Extract hour
daytime_traffic['hour'] = daytime_traffic['date_time'].dt.hour

# Separate weekdays (Mon–Fri) and weekends (Sat–Sun)
business_days = daytime_traffic[daytime_traffic['dayofweek'] <= 4]
weekend = daytime_traffic[daytime_traffic['dayofweek'] >= 5]

# Group by hour
by_hour_business = business_days.groupby('hour').mean(numeric_only=True)
by_hour_weekend = weekend.groupby('hour').mean(numeric_only=True)

# Create subplots
fig = make_subplots(rows=1, cols=2, column_widths=[0.5, 0.5])

# Line plots using Scatter with mode='lines'
traffic_volume_weekday_plot = go.Scatter(
    x=by_hour_business.index,
    y=by_hour_business['traffic_volume'],
    mode='lines',
    name='Weekday'
)

traffic_volume_weekend_plot = go.Scatter(
    x=by_hour_weekend.index,
    y=by_hour_weekend['traffic_volume'],
    mode='lines',
    name='Weekend'
)

# Add traces
fig.add_trace(traffic_volume_weekday_plot, row=1, col=1)
fig.add_trace(traffic_volume_weekend_plot, row=1, col=2)

# Update layout
fig.update_layout(
    title={'text': 'Traffic Volume by Hour: Weekday vs Weekend', 'x': 0.5},
    width=1000,
    height=400,
    xaxis1_title='Hour of Day',
    yaxis1_title='Traffic Volume',
    xaxis2_title='Hour of Day',
    yaxis2_title='Traffic Volume'
)

fig.show()

When plotted:

  • Business days have two clear peaks: around 07:00 and 16:00, aligning with morning and evening commutes

  • Weekends have flatter, lower traffic throughout the day

🌦️ Weather and Traffic

In [187]:
daytime_traffic.columns
Out[187]:
Index(['holiday', 'temp', 'rain_1h', 'snow_1h', 'clouds_all', 'weather_main',
       'weather_description', 'date_time', 'traffic_volume', 'month', 'year',
       'dayofweek', 'hour'],
      dtype='object')
In [188]:
daytime_traffic.info()
<class 'pandas.core.frame.DataFrame'>
Index: 23877 entries, 0 to 48198
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   holiday              0 non-null      object        
 1   temp                 23877 non-null  float64       
 2   rain_1h              23877 non-null  float64       
 3   snow_1h              23877 non-null  float64       
 4   clouds_all           23877 non-null  int64         
 5   weather_main         23877 non-null  object        
 6   weather_description  23877 non-null  object        
 7   date_time            23877 non-null  datetime64[ns]
 8   traffic_volume       23877 non-null  int64         
 9   month                23877 non-null  int32         
 10  year                 23877 non-null  int32         
 11  dayofweek            23877 non-null  int32         
 12  hour                 23877 non-null  int32         
dtypes: datetime64[ns](1), float64(3), int32(4), int64(2), object(3)
memory usage: 2.2+ MB
In [189]:
daytime_traffic[['temp', 'rain_1h', 'snow_1h', 'clouds_all', 'date_time', 'traffic_volume', 'month', 'year',
       'dayofweek', 'hour']].corr()['traffic_volume']
Out[189]:
temp              0.128317
rain_1h           0.003697
snow_1h           0.001265
clouds_all       -0.032932
date_time        -0.007153
traffic_volume    1.000000
month            -0.022337
year             -0.003557
dayofweek        -0.416453
hour              0.172704
Name: traffic_volume, dtype: float64
In [190]:
# Create subplots
fig = make_subplots(rows=1, cols=2, column_widths=[0.5, 0.5])

# Left: Outlier Present
fig.add_trace(
    go.Scatter(
        x=daytime_traffic['traffic_volume'],
        y=daytime_traffic['temp'],
        mode='markers',
        name='Outlier Present'
    ),
    row=1, col=1
)

# Right: Outlier Absent (same data for now, but you can filter it if needed)
fig.add_trace(
    go.Scatter(
        x=daytime_traffic['traffic_volume'],
        y=daytime_traffic['temp'],
        mode='markers',
        name='Outlier Absent'
    ),
    row=1, col=2
)

# Update layout and axis titles
fig.update_layout(
    title={
        'text': 'Traffic Volume Vs Temperature',
        'x': 0.5,
        'xanchor': 'center'
    },
    width=1200,
    height=400,
)

# Shared axis labels
fig.update_xaxes(title_text="Traffic Volume", row=1, col=1)
fig.update_yaxes(title_text="Temperature", row=1, col=1)

fig.update_xaxes(title_text="Traffic Volume", row=1, col=2)
fig.update_yaxes(title_text="Temperature", row=1, col=2, range=[230, 320])  # Outlier Absent

fig.show()

The Traffic Volume Vs Temperature graphs show that temperature is not a suatable indicator fo traffic volume.

It is best to explore other weather related columns.

Weather Types¶

In [191]:
# Group and aggregate the data
traffic_by_weather_main = daytime_traffic.groupby('weather_main').mean(numeric_only=True)

# Create a horizontal bar chart
fig = go.Figure(
    go.Bar(
        x=traffic_by_weather_main['traffic_volume'],
        y=traffic_by_weather_main.index,
        orientation='h'
    )
)

# Update layout to match Matplotlib's style
fig.update_layout(
    title='Traffic Volume Vs Weather Type',
    xaxis_title='Traffic Volume',
    yaxis_title='Weather Type',
    height=500,
    width=700
)

fig.show()
In [192]:
traffic_by_weather_main.describe()
Out[192]:
temp rain_1h snow_1h clouds_all traffic_volume month year dayofweek hour
count 11.000000 11.000000 11.000000 11.000000 11.000000 11.000000 11.000000 11.000000 11.000000
mean 283.735656 0.696040 0.000390 64.851857 4611.190225 6.662984 2015.723636 2.778513 12.377507
std 8.512431 1.170541 0.000648 22.809767 212.710591 0.378205 0.274677 0.317491 0.983460
min 267.984505 0.000000 0.000000 1.670265 4211.000000 5.832134 2015.321420 2.000000 10.325967
25% 278.500233 0.027026 0.000000 63.333774 4480.452258 6.441921 2015.542564 2.752270 12.230705
50% 283.812078 0.170804 0.000000 74.961435 4623.976475 6.734285 2015.619429 2.895102 12.467626
75% 289.747717 0.949167 0.000559 75.527076 4796.992361 6.916667 2015.899443 2.944984 12.802994
max 296.730000 3.972943 0.001768 84.704417 4865.415996 7.108647 2016.261641 3.138928 14.000000

Weather Description¶

In [193]:
# Group by weather description and compute average traffic volume
traffic_by_weather_description = daytime_traffic.groupby('weather_description').mean(numeric_only=True)

# Optional: sort for better readability
traffic_by_weather_description = traffic_by_weather_description.sort_values('traffic_volume', ascending=True)

# Create horizontal bar chart
fig = go.Figure(
    go.Bar(
        x=traffic_by_weather_description['traffic_volume'],
        y=traffic_by_weather_description.index,
        orientation='h'
    )
)

# Update layout to match matplotlib style
fig.update_layout(
    title='Traffic Volume Vs Weather Type',
    xaxis_title='Traffic Volume',
    yaxis_title='Weather Type',
    height=800,
    width=800,
    margin=dict(l=150)  # Add more left margin if labels are long
)

fig.show()

The dataset includes several weather-related columns:

  • temp

  • rain_1h

  • snow_1h

  • clouds_all

  • weather_main

  • weather_description

Looking at average traffic volume by weather description, I found three weather types that stood out — all with over 5,000 vehicles/hour on average:

  • Shower snow

  • Light rain and snow

  • Proximity thunderstorm with drizzle

At first this seemed odd — why would bad weather increase traffic? My hypothesis is that during unpleasant but not extreme weather, people are more likely to drive instead of biking or walking, which results in a bump in car usage.

✅ Final Summary¶

This analysis revealed two main types of indicators for high traffic:

🔹 Time-Based Indicators¶

  • Heavier traffic during warmer months (March–October)

  • Weekdays are busier than weekends

  • Rush hour peaks at around 07:00 and 16:00

🔹 Weather-Based Indicators¶

  • Some moderately bad weather conditions lead to higher traffic volumes

  • Possibly because people prefer using personal vehicles during such weather


This project helped me practice:

  • Working with time series data

  • Cleaning and transforming datetime features

  • Visualizing distributions and trends

  • Formulating and testing data-driven hypotheses

I'm excited to keep building on these skills as I explore more complex datasets and start incorporating machine learning into my projects!